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Abstract 



We propose a non-parametric link prediction algorithm for a sequence of graph snapshots over 
time. The model predicts links based on the features of its endpoints, as well as those of the local 
neighborhood around the endpoints. This allows for different types of neighborhoods in a graph, 
each with its own dynamics (e.g, growing or shrinking communities). We prove the consistency 
of our estimator, and give a fast implementation based on locality-sensitive hashing. Experiments 
^ ' with simulated as well as three real-world dynamic graphs show that we outperform the state of 

, the art, especially when sharp fluctuations or non-linearities are present in the graph sequence. 

o 

1 Introduction 

I The problem of predicting links in a graph occurs in many settings - recommending friends in social 

networks, predicting movies or songs to users, market analysis, and so on. This has led to widespread 
interest in the problem. However, the state of the art methods suffer from two weaknesses. First, 
the literature mostly consists of heuristics (common neighbors, weighted common neighbors, counting 
paths, etc.) which, while working very well in practice, have not been thoroughly analyzed in terms 
of their underlying models, asymptotic consistency guarantees, rates of convergence to the asymptotic 
I estimates, etc. ([19J is one step in this direction). Second, almost all the heuristics are meant for 

predicting links from one static snapshot of the graph. However, graph datasets often carry additional 
temporal information such as the creation and deletion times of nodes and edges, so the data is better 
viewed as a sequence of snapshots of an evolving graph. The historical evolution of the graph is usually 
ignored, apart from a few studies such as |15| . We focus on link prediction in such dynamic graphs, and 
propose a non-parametric method that (a) makes no model assumptions about the graph generation 
process, (b) leads to formal guarantees of consistency, and (c) offers an extremely fast and scalable 
implementation via locality sensitive hashing (LSH). 

Our approach falls under the framework of non-parametric time series prediction, which models 
the evolution of a sequence xt over time [16| . Each xt is modeled as a function of a moving window 
(xt-i, . . . , xt-p), and so xt is assumed to be independent of the rest of the time series given this window; 
the function itself is learnt via kernel regression. In our case, however, there is a graph snapshot in each 
timestep. The obvious extension of modeling each graph as a multi-dimensional Xt quickly runs into 
problems of high dimensionality, and is not scalable. Instead, we appeal to the following intuition: the 
graphs can be thought of as providing a "spatial" dimension that is orthogonal to the time axis. In the 



1 



spirit of the time series model discussed above, our model makes the additional assumption that the 
linkage behavior of any node i is independent of the rest of the graph given its "local" neighborhood 
or cluster N{i); in effect, local neighborhoods are to the spatial dimension what moving windows 
are to the time dimension. Thus, the out-edges of i at time t are modeled as a function of the local 
neighborhood of i over a moving window, resulting in a much more tractable problem. This model also 
allows for different types of neighborhoods to exist in the same graph, e.g., regions of slow versus fast 
change in links, assortative versus disassortative regions (where high-degree nodes are more/less likely 
to connect to other high-degree nodes), densifying versus sparsifying regions, and so on. An additional 
advantage of our non-parametric model is that it can easily incorporate node and link features which 
are not based on the graph topology (e.g., labels, in labeled graphs). 

Our contributions are as follows: 
(1) Non-parametric problem formulation: We offer, to our knowledge, the first non-parametric model 
for link prediction in dynamic graphs. The model is powerful enough to accommodate regions with very 
different evolution profiles, which would be impossible for any single link prediction rule or heuristic. 
It also enables learning based on both topological as well as other externally available features (such 
as labels). 

(2) Consistency of the estimator: Using arguments from the literature on strong mixing, we show 
mean square consistency of our estimator. The closest analysis akin to ours is the analysis of subsam- 
pling from random fields |17j . which also makes similar assumptions. 

(3) Fast implementation via LSH: Non-parametric methods such as kernel regression can be very 
slow when the kernel must be computed between a query and all points in the training set. We adapt 
the locality sensitive hashing algorithm of [11' for our particular kernel function, which allows the link 
prediction algorithm to scale to large graphs and long sequences. 

(4) Empirical improvements over previous methods: We present our empirical results in two parts. 
First we show that on graphs with non-linearities, such as seasonally fluctuating linkage patterns, we 
outperform all of the state-of-the-art heuristic measures for static and dynamic graphs. On real-world 
datasets whose evolution is far smoother and simpler, we perform as well as the best competitor. 

The rest of the paper is organized as follows. We present the model and prove consistency in 
Sections [D and |31 We discuss our LSH implementation in Section We give empirical results in 
Section [5l followed by related work and conclusions in Sections |6] and [T) 

2 Proposed Method 

Let the observed sequence of directed graphs heQ = {d, G2, . . . , Gt}. Let Gt^p = {Gt,Gt~i, . . . , Gt-p+i}- 
A naive extension of prior work on time series would model Gt+i as a stochastic function of Gt^p. How- 
ever, this has very high dimensionality and is difficult to scale. 

One solution is to learn a local prediction model for each possible edge i — t- j using some features, 
and patch these predictions together to form the entire graph. However, the features have to be chosen 
carefully. If they only encode narrow relationships between i and j (e.g., their degrees and number of 
common neighbors), the predictor cannot adapt to local variations in linkage patterns (e.g., different 
communities in the graph behaving differently). On the other hand, adding in a large number of 
features describing properties of the entire graph would again render the problem unscalable. Instead, 
we propose a model that combines some features specific to i ~^ j along with others that describe the 
local neighborhood of i. This keeps the dimensionality under control while still being powerful enough 
to distinguish between local variations. 

Specifically, let Yt{i,j) = 1 if the edge i ^ j exists at time i, and let Yt{i,j) = otherwise. Let 
Nt{i) be the local neighborhood of node i in Gt] in our experiments, we define it to be the set of 
nodes within 2 hops of i (up to a maximum of some large value A), and all edges between them. Note 
that the neighborhoods of nearby nodes can overlap. Let Nt.p{i) — {Nt{i), . . . , Nt-p+i{i)} . Then, our 
model is: 

Yt+i{i,j)\g - BcT{g{tpt{i,j))), i^t{i,j) = {st ii,j),dt (i)} 
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where < g{.) < 1 is a function of two sets of features: those specific to the pair of nodes under 
consideration (st and those for the local neighborhood of the endpoint i (dt («)). We require that 

both of these be functions of Nt,p{i)- Thus, Yt+i{i,j) is assumed to be independent of Q given Nt^p{i), 
hmiting the dimensionahty of the problem. Also, two pairs of nodes and {i',j') that are close to 
each other in terms of graph distance are likely to have overlapping neighborhoods, and hence higher 
chances of sharing neighborhood-specific features. Thus, link prediction probabilities for pairs of nodes 
from the same graph region are likely to be dependent, which agrees with intuition. 

Next, we will discuss the forms of the features and the estimator. Then we discuss some imple- 
mentation details and extensions for very sparse graphs. The asymptotic consistency of our estimator 
will be shown in the next section. 

Features. Let St {i,j) be any pair-specific feature mapping; we only assume that the range of St {i,j) 
is a finite set S. Let dt (i) — {r/n (s) ,7]^^ (s) Vs G S}, where r/n (s) are the number of node pairs in 
Nt-i{i) with feature vector s, and 77^ (s) the number of such pairs which were also linked by an edge 
in the next timestep t. Thus, dt (i) tells us the chances of an edge being created in t given its features 
in i — 1, averaged over the whole neighborhood Nt^i{i) — in other words, it captures the evolution 
of the neighborhood around i over one timestep. The intuition behind this choice of dt {i) is that 
past evolution patterns of a neighborhood are the most important characteristics in predicting future 
evolution. In the following, we often refer to each feature s as the "cell" s with contents {r/it (s) , 77^ (s)), 
and dt{i) as the "datacube" of all these cells. 
Estimator. Our estimator of the function g{.) is: 

E^'rt'SMM^.J).^t'{^',f))■Yt,+,i^',f) 

9m[t,J))^ ^-^^ a- / I /■ ■\ I i-i ■/^^ ■ ^) 

Ei',j',t' Sim(^/>t (i, j), V't' («', j')) 

To reduce dimensionality, we factor Sim(.) into pair-specific and neighborhood-specific parts: 

SMMhj),^t'(i',3')) = K{dt (i),dt- ii')) ■ I{st = St' (2) 

In other words, the similarity measure Sim(.) computes the similarity between the two neighborhood 
evolutions (i.e., the "datacubes" ) , but only for pairs {i',j') that had exactly the same features as the 
query pair at time t (i.e., pairs belonging to the "cell" s = 8t{i,j)). Combining Equations [T] 

andEJ we can get a different interpretation of the estimator: 

J:^',t'K{dt{^),dt>{^'))■J:r[I{^t{^,J)^St,{^',f)}■Yt,M^',f)] 

giMhj)) = — 



Y.,,t' K {dt {i) , dv (^')) • E,' Ibt (», j) = se (z', 3')} 

Y.,,fK{dt {i).dt, {i'))-4t'+ii^t (»,j)) 

E^',t' K {dt {i) , df {i')) ■ iwt'+i (si {i, j)) 



(3) 



Intuitively, given the query pair (i, j) at time t, we look only inside cells for the query feature s = St {i, j) 
in all neighborhood datacubes, and compute the linkage probability rj^^, (s) /riiif (s) in these cells; 
these probabilities are then averaged by the similarities of the datacubes to the query neighborhood 
datacube. Thus, the probabilities are computed from historical instances where (a) the feature vector 
of the historical node pair matches the query, and (b) the local neighborhood was similar as well. 

Now, we need a measure of the closeness between neighborhoods. Two neighborhoods are close if 
they have similar probabilities p{s) of generating links between node pairs with feature vector s, for 
any s € S. We could simply compare point estimates p{s) = ry"*" (s) /ry. (s), but this does not account 
for the variance in these estimates. Instead, we consider a normal approximation to the posterior of 
p{s), and use the total variation distance between these normals as a measure of the closeness. More 
precisely, let ps ^ r^tt (*) hit (s) and p(, = vj^^, {s) /r]^'t' («)• Then, 

K{dt{i),dt'{i')) = 6^('i*WA'(»')) (0 < 6 < 1) 
whe,e,W(.).*,(0) ^ (gTv(x(p..?^),x(p;,£ii^))] (4) 
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where TV(., .) represents the total variation distance between its two argument distributions, and < 
6 < 1 is a bandwidth parameter such that K{dt («) , df (i')) — ?> as ^> unless D {dt (i) , df {i')) = 0. 
Implementation Details and Extensions. While our theoretical results (presented in the next 
section) apply to any feature vector St (*, j) that has finite support S, the precise form is clearly of 
significant practical interest. Choosing features that were found to be useful in prior work [211 US], 
we set St{i,j) — {degreef{i),degreef{j),cnt{i,j),£it{i,j)}, where cnt(i,j) represents the number of 
common neighbors of i and j at time t, and ££t{i,j) the number of timesteps elapsed since the edge 
i ^ j last occurred in Gt^p] this is if the edge occurs in time t, and is set to p if it never occurs 
in Gt,p- Features based on node and edge labels can also be used here. All feature values are binned 
logarithmically in order to combat sparsity in the tails of the feature distributions (this is particularly 
true for degrees, which are commonly observed to follow power laws). 

If the graphs are too sparse, then two practical problems can arise. First, a node i could have zero 
degree and hence an empty neighborhood. In such cases, we extend the definition of neighborhood 
to include nodes j that were connected to i in the previous p timesteps (p > 1). Note that the 
datacube dt (i) still remains a function of N^{i) as required by the model. Second, r], (s) and r/+ (s) 
could be zero for many different s G 5*. Hence, many historical neighborhoods that are close to the 
query neighborhood might not have any data for the query feature s — St (*, j), rendering them useless 
for estimation purposes. In such cases, we combine 77. (s) and 77^ (s) with a weighted average of the 
corresponding values for any s' that are "close" to s, the weights encoding the similarity between 
s' and s. This ensures that some estimates are available even in extremely sparse portions of the 
neighborhoods. 



3 Consistency of Kernel Estimator 

Now, we prove that the estimator g defined in Eq. [3]is consistent. Recall that our model is as follows: 

Yt+,{i,j)\g ^BeiigiMhM, Mh j) ^ {i, j) , dt (t)} (5) 

For simplicity, assume that the number of nodes in a graph at all timesteps is n. From Eq. |3l the 
kernel estimator of g for query pair (q, q') at time T can be written as: 

g{s,dT{q)) = y^^''^,^^^^! (where s = ST (g, g')) 
/(s,dT (q)) 

^ T-2 n 

^^'^^^'^^^^ = ^7:f—Y)T.T.^'>{dt{i),dT{q))vtt+l{s) 

P ) t=p i=l 

^ T-2 n 

f{s,dTiq)) = — —y2y2^>>{dtii),dTiq))^]^t+l{s) 

^ ^ ' t=p 1=1 

The estimator g is defined only when / > 0. The kernel was defined earlier as Kb{dt [i) , {q)) = 
j^D{dt{i),dT{q)) ^ where h is the bandwidth which tends to zero as the total number of datapoints approach 
infinity, and D{.) is the distance function defined in Eq. |4l This is similar to other discrete kernels jU 
[22] . and, analogous to kernels in TZ, this kernel has the following propertj{l|: 

,. (A (■\ ,^ I \\ liD{dt{i),dT{q))^QA-(i:dt{i)^dT{q) , . 

\imKi[dt{i),dT{q))^ { . (6) 

b^o Otherwise 



^This can be shown from our definition of D{.) and assumption (Al) discussed later. 
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Theorem 3.1 (Consistency). If E[f] > 0, theng is a consistent estimator of g, i.e., 'g — > g. 

Proof. The condition E[f] > is necessary to ensure that enough data matching the query is seen 
for the estimates to converge. This corresponds to the assumption in traditional kernel regression of 
requiring that the true density is strictly positive. The proof is in two parts. Define B — {E[h] — 
gE[f])/E[f]. We can show that g-g-B = {l/f){(h - E[h]) ~ {g + B){f- E[f])). Thm. [X^ wiU show 

that B = o(l), and then Thm. I3.5l that g — g — B — > 0. Together, they imply that g ~ g — > 0, thus 
proving that g is a consistent estimator of g. □ 

We introduce the first two of our assumptions here. (Al) Finite dimensionality: The graphs have 
a bounded intrinsic dimensionality p that is independent of n and T. Consequently, our two-hop 
neighborhood sizes are also bounded by some value A. (A2) Bandwidth convergence: The kernel 
bandwidth parameter 5 — >■ as uT — > co. 

Theorem 3.2. limb^o B = 0. Since 6 —5- as uT — > cxd (by assumption (A2)), then this implies that 
B = o{\). 

Proof. For t G [p, T — 2]; « G [1, A^]; s = st ((?, q'), the numerator of B is an average of the terms: 
E [Kb{dt {i) , dT [qMt+i (5)] ~ E [Kt{dt (i) , dr {q)Ht+i is)] dr (g)). 

Taking expectations w.r.t. dt (i), and denoting Kb{dt (i) ,dT (q)) by 7, the first term becomes: 

E [ivtt+i (*)] = E [jE [rj+^, is) \dt (z)]] = E [77;,*+! (s) ■ g{s, dt (z))] 

where the last equality follows from the fact that E [rift_^_i (s) \dt (i)] ~ Vit+i (s) • g{s, dt (i)), as can be 
seen by summing Eq. [5]over all pairs {i,j) in a neighborhood with identical St {i,j), and then taking 
expectations. Thus, the numerator of B becomes an average of terms of the following form: 
E [Kb{dt (i) , dr (g))'7it+i (s) ■ (ff (s, dt (i)) — g{s, dr (q)))] ■ This expectation is over all possible configura- 
tions of the neighborhoods Nt{i) and Nt+i{i). Since our neighborhood sizes are bounded by A which 
is constant w.r.t N and T (assumption (Al)), the expectation is a finite sum of terms (independent of 
N or T). 

We now let 5 — ?> 0. Since the expectation is over a finite sum, so we can take the limit inside 
the expectation. Since limb^o Ki,{dt (i) , (q)) is zero unless dt{i) — dr (q) (Eq. [5]), we find that 
E [Ki,{dt (?) , dx {q))riit+i (s) • {g{s, dt (i)) — g{s, dr (<?)))] 0, for any i and t. So the numerator of B 
goes to zero asymptotically. Hence, the estimator g is unbiased. □ 

Before proving the second part of Thm. 13. 1[ we prove the following lemmas. 

Lemma 3.3. Let q{i,t,t + 1) be a bounded function of rjit+i {s),ri^^-^ (s) and dt (i). Then, under 



assumption (A3) below, var 



Tn 



Proof. The variance can be expressed as: 



as nT — >■ cxd. 



(nT)-2 ^^^^ var[g(z, t, t + I)] + ^ cov [q{i, t,t + l), q{i' , t',t' + 1)] 

Since q is bounded, so are the variances, so the total contribution from the variances is 0(l/nT). 
Among the covariance terms, some pairs {i, t,t + \}, {i' , t' ,t' + 1} will share random variables, whereas 
the others will be disjoint. For example, if j G Nt{i), then Nt{i) intersects Nt{j) as well. Similarly, 
the datacubes summarizing the evolution of i from t —i' t + 1 and from t — 1 — > t are also overlapping. 
Consider the graph obtained by "stacking" up all the individual graphs, such that every node i is now 
also connected to node i at time t ~ I and time t -\- 1 (we consider a lag of p = 1; the extension to 
arbitrary^ is simple). Denote by dist{{i,t,t+l}, {i' , t' , t' + 1}) the length of the shortest path between 
any pair of nodes in the datacubes dt (?) and df («')■ For example, with i' — i and i G — l,t -f 1] 
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this distance is zero, whereas for t' = t + 2, dist is one, and so on. We will now split the covariances 
into two groups, the pairs at dist zero and those at dist > 0. The number of overlapping datacubes 
with a given datacube is upper-bounded by a constant since the neighborhood size A is bounded 
(assumption (Al)), and the lag p in the time series is a constant w.r.t. T. Also since q is bounded, 
cov(g(i, t,t + l),q(i', t', t' + 1)) is bounded by some constant. Hence, 

JtA^ ^ cov[g(z,t,t + l),g(z',t',t' + l)]) <-g; (7) 

\ td {t',i'|dist=0} / 

For the remaining covariance terms, we introduce our final assumption. (A3) Strong mixing: 
Define the strong mixing coefficients = supj^^f^i^^^^,f^i,^{\P{Ar\B) — P{A)P{B)\ : A & T{dt {i)),B £ 
J-{dt' {i')), dist {{i,t,t + 1}, {i',t',t' + 1}) > k}, where J^{dt (i)) is the sigma algebra generated by the 
random variables in the datacube for i,t. Then, following [17], we assume that X^feLi Oi{k)k''~^ < oo. 

We will use the well known mixing inequality for covariances of bounded random variables (see [3]): 
cov [q{i, t, t + 1), q{i', t' , t' + 1)] < C" x a(dist({i, t, t + 1}, {i' , t' , t' + 1})). 

Let A^(i,t,dist) denote the total number of pairs {i',t') at distance dist from {i,t). Note that for a 
given {i,t) this part of the covariance can be written as a sum over dist. Now using assumption (A3) 
and Lemma 13.41 below, we obtain: 

\ / \ 

(8) 



1 / \ r"' / \ r"" 

E ^' X «(dist)iV(M,dist) < E "(dist)dist''-i < — 

t,i \dist>0 / tA \dist>0 / 



Thus, all terms in the initial variance expression are 0(l/(nT)), which tends to zero. □ 

Lemma 3.4. For the "stacked" time series graph with lag p (which is a constant w.r.t T) built from a 
sequence of temporal graphs with bounded intrinsic dimensionality p, we have: N(i,t, dist) — 0(p(4 + 
dist)P-^) = OidistP-^). 

Proof. This follows by a direct application of the definition of intrinsic dimensionality in |14| , and the 
fact that we are using 2-hop neighborhoods. □ 

Theorem 3.5. // E[f] > 0, then g - g ~ B ^ 0. 

Proof. Since our neighborhood sizes are bounded (assumption (Al)), rj^^^ (s) and 77^ (s) are also 
bounded. By definition, g is a probability and is bounded, as is the kernel Ki,. Using Lemma 1531 with 
q{.) = Kb{dt {i),dT {q))r]it+i (s) and q{.) = Kb{dt (i) .dr {q)){'rtu+i (*) " (ff + B)riit+i (s)), we see that 
f — > E[f] and {[h — {g + B)f] — E[h — {g + B)f]) — > 0. Since convergence in quadratic mean implies 
convergence in probability, we have: 

(/, [h-{g + B)f] - E[h - (.9 + B)f]) A {E[f],0). 

If E[f ] > 0, we can use the Continuous Mapping theorem on the function f{X,Y) ~ Y/X to obtain 
the result. Note that this function is continuous at every point of a set C, s.t. P{{X,Y} G C) = 1, 
since Y = E[f] > 0. This completes the proof for Thm. □ 



4 Fast search using LSH 

A naive implementation of the non-parametric estimator (eq[3|) searches over all n datacubes for each 
of the T timesteps for each prediction, which can be very slow for large graphs. In most practical 
situations, the top-fc closest neighborhoods should suffice (in our case k = 100). Thus, we need a fast 
sublinear-time method to quickly find the top-fc closest neighborhoods. 
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We achieve this via locahty sensitive hashing (LSH) [TT]. The key step is to devise a hashing 
function for neighborhoods such that neighborhoods that are close according to our similarity function 
are likely to be mapped to the same hash bucket. Recall that the similarity between two datacubes is 
the sum of the total variation distances between the distributions in each corresponding cell. We now 
represent the distribution in each cell as a sequence of bits, so that the Hamming distance between 
the bit representations of two distributions approximates their total variation distance. LSH schemes 
from [11' for hamming distances can then be used. 

Conversion to bit sequence. The key idea is to approximate the linkage probability distribution 
by its histogram. We first discretize the range [0, 1] (since we deal with probabilities) into Bi buckets, 
and compute the probability mass that falls in each bucket. To encode the probability mass, each 
bucket is encoded in B2 bits; if the probability mass in the bucket is p, then the first [p ■ bits are 
set to 1, and the others to 0. Thus, the entire distribution (i.e., one cell) is represented by Bi ■ B2 bits. 
The entire datacube can be stored in \S\ ■ Bi ■ B2 bits. However, in all our experiments, datacubes 
were very sparse with only M ^ 15*1 cells ever being non-empty (usually, 10-50); thus, we use only 
M ■ Bi ■ B2 bits in practice. 

We use this particular representation because the total variation distance between discrete distri- 
butions is half the Li distance between the corresponding probability mass functions. The Li distance, 
in turn, is just the Hamming distance between the bit representations above. Thus the distance be- 
tween two datacubes can be approximated by computing the Hamming distance between two pairs of 
M ■ Bi ■ B2 bit vectors, using the following LSH scheme. 

Distances via LSH. We create a hash function by just picking a uniformly random sample of k bits 
out of M-Bi-B2- For each hash function, we create a hash table that stores all datacubes whose hashes 
are identical in these k bits. We use £ such hash functions. Then, given a query datacube, we hash it 
using each of these £ functions, and then create a candidate set of up to 0(max(£, top-k)) of distinct 
datacubes that share any of these £ hashes. The theoretical guarantees on finding the nearest neighbors 
in this set with high probability can be found in jll] . The total variation distance of these candidates 
to the query datacube is computed explicitly, yielding the closest matching historical datacubes. 

Finally, we underscore a few points. First, the entire bit representation oi M ■ Bi ■ B2 bits never 
needs to be created explicitly; only the hashes need to be computed, and this takes only 0{k ■ £) time. 
Second, the main cost in the algorithm is in creating the hash table, which needs to be done once as a 
preprocessing step. Query processing is extremely fast and sublinear, since the candidate set is much 
smaller than the size of the training set. Finally, we have found the loss due to approximation to be 
minimal, in all our experiments. 

5 Experiments 

We first introduce several baseline algorithms, and the evaluation metric. We then show via simulations 
that our algorithm outperforms prior work in a variety of situations modeling non-linearities in linkage 
patterns, such as seasonality in link formation. Finally, we demonstrate high link prediction accuracy 
on real-world dynamic graphs as well, in particular, the co-authorship graphs in the Physics and NIPS 
communities, and the "machine learning" community according to Citeseer. 

Baselines and metrics. We compare our non-parametric link prediction algorithm to the following 
well-known baselines, each of which gives a ranking of node pairs in terms of their probability of forming 
a link in the next timestep: (a) LastLink (LL), which orders node pairs according to the last time when 
they were linked, with more recently linked pairs being ranked higher |21] . (b) Common Neighbors 
(CN), which gives higher ranking to pairs with more common neighbors in the last timestep [15] . (c) 
Adamic/Adar (AA), which considers common neighbors weighted by their degrees |2], and (d) Katz 
measure (Katz), which extends CN to longer paths but gives exponentially smaller weights to such 
paths [13. We compare these against the non-parametric link prediction algorithm described above 
(Non-Param), and also against a random ranker (Random). 

For any graph sequence (Gi, . . . , Gt), we test link prediction accuracy on Gt for a subset S'>o of 
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Data 


Non-Param 


LL 


CN 


AA 


Katz 


Random 


Seasonal (T = 


10,0 = 0.5) 


0.82 


0.65 


0.51 


0.51 


0.51 


0.50 


Stationary (T 


= 10,0 = 0.5) 


0.97 


0.94 


0.88 


0.88 


0.97 


0.50 


Seasonal (T = 


4) 


0.41 


0.58 


0.46 


0.46 


0.45 


0.49 


Seasonal (T = 


12) 


0.88 


0.68 


0.50 


0.50 


0.50 


0.50 


Seasonal (T = 


24) 


0.89 


0.72 


0.52 


0.52 


0.52 


0.50 


Seasonal (0 = 


0.3) 


0.83 


0.65 


0.50 


0.50 


0.50 


0.50 


Seasonal (0 = 


0.05) 


0.77 


0.68 


0.47 


0.47 


0.47 


0.50 



Table 1: Average AUG on simulated graphs 



nodes with non-zero degree in Gt- Each algorithm is provided training data until timestep T — 1, and 
must output, for each node i G S'>o, a ranked list of nodes in descending order of probability of linking 
with i in Gt- For purposes of efficiency, we only require a ranking on the nodes that have ever been 
within two hops of i; all algorithms under consideration predict the absence of a link for nodes outside 
this subset. Now, for each i G 5'>o, we compute the AUG score for the predicted ranking against the 
actual edges formed in Gt, and we report the mean AUG over S^q. 

5.1 Simulations 

One unique aspect of Non-Param is its ability to predict even in the presence of sharp fluctuations in 
the data. As an example, we focus on seasonal patterns. While co-authorship graphs have a minor 
seasonality component, our experiments on such graphs use a longer duration per timestep that smears 
out the seasonality, to be fair to the competing algorithms. Hence, the effects of seasonality are best 
shown via simulations, described next. 

We simulate a model of Hoff et al. 0, that posits an independently drawn "feature vector" for each 
node. Time moves over a repeating sequence of seasons, with a different set of features being "active" 
in each. Nodes with these features are more likely to be linked in that season, though some links are 
also formed due to noise. There are two parameters: the number of timesteps T, and the strength 
of the seasonality (seasonal linkages become likelier as (j) increases). All graphs have 50 nodes, and 
results are averaged over multiple runs. 

Effect of seasonality. Glearly, Seasonal has nonlinear linkage patterns; the best predictor of links 
at time T are the links at times T — 3, T — 6, and so on. However, all the baseline algorithms are 
biased towards predicting links between pairs which were linked (or had short paths connecting them) 
at the previous timestep T — 1; this implicit smoothness assumption makes them suffer heavily. On 
the other hand, for Stationary, links formed in the last few timesteps of the training data are good 
predictors of future links, and so the difference between the algorithms is marginal. We note, however, 
that Non-Param performs very well in all cases. 

Effect of longer history. For T = 4, Non-Param performs poorly, because the training data 
includes only the shift from seasons 1 to 2, and 2 to 3, but not from 3 to 1 again, and this is what it 
needs to predict. The error is rectified as T increases, with performance eventually flattening out near 
T — 24. LL also improves, simply because it sees more links from previous instances of the test season; 
edges seen in other seasons hurt its performance, but these are relatively sparse. As before, CN, Katz, 
and A A are mostly unaffected: they use only the structure of the graph in the last timestep to predict 
future links. 

Effect of increased block weight. Finally, we vary the importance of the block in determining 
links by testing several values of the seasonality strength 0. As it decreases, Non-Param is hurt more. 
This is expected, since most edges are now being created due to baseline noise. 
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Data 


Nodes 


Total Edges 


Non-Param 


LL 


CN 


AA 


Katz 


Random 


NIPS 


2,865 


4, 740 


0.85 


0.84 


0.68 


0.70 


0.83 


0.51 


HepTh 


14,737 


24, 534 


0.87 


0.86 


0.67 


0.69 


0.81 


0.49 


Citeseer 


20,912 


36, 245 


0.82 


0.85 


0.63 


0.64 


0.68 


0.50 



Table 2: Average AUC on real-world graphs. 



5.2 Real- world dynamic graphs 

The simulation experiments above demonstrate the utility of non-parametric link prediction especially 
in the presence of non-linearities such as seasonality. Now, we show that the accuracy of this method 
extends even to real- world dynamic graphs, for which the baseline algorithms are among the best 
known. In particular, we perform experiments on the co-authorship graphs of the subset of the Physics 
community ( "HepTh" ) [1] , authors in NIPS [6] , and authors of papers indexed by Citeseer that contain 
"machine learning" in their abstracts. Each timestep looks at 2 years of papers (1 year for Citeseer), 
with nodes being linked if they were co-authors on any paper in that time period. 

Table [2] shows the average AUC for all the algorithms. Non-Param and LL are better than all other 
algorithms; between the two, Non-Param is better on NIPS and HepTh. This is expected, since LL has 
been shown to perform very well in many real- world graphs |21| , probably because their evolution over 
time tends to be smooth and linear (seasonality effects are not important in co-authorship graphs). 

6 Related Work 

Link prediction in static networks is a well studied problem, whereas link prediction in evolving net- 
works is slowly becoming important with the growing popularity of large social networks (LiveJournal, 
Facebook) networks; evolving citation networks (Citeseer, DBLP) etc. Existing work on link pre- 
diction in dynamic social networks can be broadly divided into two categories: generative model 
based [20l H El [Hig and graph structure based [T0l[2T]. 

Generative models. These include extensions of Exponential Family Random Graph models [7] 
by using evolution statistics like "edge stability", "reciprocity", "transitivity"; extension of the latent 
space models for static networks [18] by allowing smooth transitions in latent space [2^ , and extensions 
of the mixed membership block model to allow a linear Gaussian trend in the model parameters [S] . In 
other work, the structure of evolving networks is learnt from node attributes changing over time. This 
includes nonparametric estimation of time-varying Gaussian random fields with a smoothly changing 
covariance matrix [23] and its discrete analog [13] where the graph structure of time- varying Markov 
Random Fields are estimated. Although these models are both intuitive and rich, they generally a) 
make strong model assumptions, b) require computationally intractable posterior inference, and c) 
explicitly model linear trends in the dynamics of the networks. 

Models based on structure. Huang et al. [10] proposed a linear autoregressive model for individual 
links, and also built hybrids using static graph similarity features. In |21j the authors examined simple 
temporal extensions of existing static measures for link prediction in dynamic networks. In both of 
these works it was empirically shown that LL and its variants are often better or among the best 
heuristic measures for link prediction. Our non-parametric method has the advantage of presenting a 
formal model, with consistency guarantees, that also performs as well as LL. 

7 Conclusions 

We proposed a non-parametric model for link prediction in dynamic graphs, and showed that it per- 
forms as well as the state of the art for several real-world graphs, and exhibits important advantages 
over them in the presence of non-linearities such as seasonality patterns. We are currently working 
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on longitudinal econometric datasets where seasonality patterns are more marked. Non-Param also 
allows us to incorporate features external to graph topology into the link prediction algorithm, and its 
asymptotic convergence to the true link probability is guaranteed. Together, these make a compelling 
case for Non-Param being a useful tool for link prediction in dynamic graphs. 
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